import util
from IPython.display import HTML, Image, display
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()">
<input type="submit" value="Click to toggle on/off the raw code.">
</form>''')
util.Image('img/Yelp_Logo.png')
Small businesses tend to have it hard in the digital age. With the proliferation of social media and review platforms, there is an abundance of customer reviews that a typical business owner cannot handle. Which reviews should they focus on? And which ones should be ignored? How can a small business owner keep up with the influx of information while keeping the business afloat?
We intend to answer these questions through Latent Semantic Analysis (LSA). Through LSA, we hope we will be able to understand the underlying characteristics of customer reviews. We utilized the Yelp website as the source of our data as it has accumulated millions of reviews since its inception in 2004. However, our study focuses on reviews made in 2017 alone while also excluding neutral reviews (which have a star rating of 3). As such, we will be using two subsets of the Yelp customer review data, positive (more than a star rating of 3) and negative (less than a star rating of 3).
Our methodology applied significant preprocessing which entailed 1) removing punctuations and symbols, 2) using a corpus of stop words from the NLTK library, and 3) implementing lemmatization using the same library. We then tokenize each row of text to create the Term Frequency-Inverse Document Frequency (TF-IDF) matrix, or the document term matrix. We then employ LSA to decompose the TF-IDF into latent topics which will be the basis of clustering each review.
As we clustered the reviews, we saw unique themes for both subsets. Positive reviews tend to focus on food and quality of service. In addition, we see a cluster where it only focuses on customers recommending the business. Curiously, we also found a cluster solely for nail salons. On the other hand, negative review clusters tend to be the opposite for both restaurants and service-based businesses. However, a unique cluster for car maintenance was found in negative customer reviews.
Yelp is a crowd-sourced business review platform founded in 2004 by former Paypal employees Jeremy Stoppelman and Russel Simmons. It is a website and mobile app where users can discover, transact, and most importantly, review local US businesses. With a net revenue of $257 million in the second quarter of 2021 and installed in over 31 million unique devices, it is the 4th largest review platform next to Google, Facebook and Amazon. The top business categories present in Yelp are home and local services, restaurants, and retail shops which comprise more than 50%. [1][2]
With most of its revenue coming from advertisements, Yelp relies on the sheer amount of customer reviews to generate website and app traffic and in turn make a profit. Highly reliant on advertisement revenue, it has acquired several businesses that complement their business model. In 2013, the platform acquired SeatMe that enabled businesses to manage bookings and other front-office operations. Today it has been rebranded as Yelp Reservations. Nowait was a similar application that catered to restaurants that managed customer reservations. It was acquired by Yelp in 2017 and is now Yelp Waitlist. [3][4]
To further support local businesses, Yelp for Business was recently updated to provide increased transparency and insight to business owners’ respective Yelp Business Pages. The platform was designed to increase client revenue with features that optimize leads generation and recommendations to accelerate traffic in the client’s business page. However, it is unclear if it provides a digested account or feedback from customer reviews. [5]
Despite its expansion to new service segments, Yelp is still centered on propagating an ecosystem for its key target market – small and medium sized businesses. As such, customer reviews will remain a key component of its business model in the future. We hope to further extract additional insight and eventually enhance Yelp’s service to its clients.
In this study, we intend to understand what are the underlying groups (or clusters) of either positive or negative Yelp reviews aided with Latent Semantic Analysis.
Our objective in undertaking this study is to create value to both the businesses present in the platform and potential customers of these businesses. First, we hope the study can shed light onto the quality of the reviews. Once we identify the most relevant reviews, we may be able to give businesses quantifiable feedback. This could be used as a tool to address critical issues in their operations that may not be apparent in their typical feedback channels (e.g. monthly sales figures, quarterly management KPI reviews, etc.).
Second, we intend the platform’s visitors to have a meaningful experience. We normally would want to see the most relevant and timely reviews at the top. However, this may not always be the case as the most recent reviews are always placed on top in most review platforms. As we understand Yelp’s reviews, we may be able to rank these reviews for the customer.
Lastly, once we take a closer look at the behavior of the reviews, we may be able to apply algorithms that could be replicated to other categories of products. Currently, we are casting a broad net in order to see if there are general trends in all of Yelp’s reviews.
# data = util.import_data()
# util.separate_reviews(data) # saves DataFrames to csv as well
As Yelp is a crowd-sourced business review platform, it has a repository of reviews for businesses from several users. Although most will visit Yelp for reviews of restaurants, there are also reviews for other businesses such as building contractors or locksmiths. However, the Yelp review data does not distinguish these various business types despite the presence of user and business IDs.
For each row of data, there is text describing the review and a corresponding rating - from 1 Star as the lowest to 5 Stars as the highest. In addition, every review is also designated by other users as either useful, funny, or cool. However, for this study, we will only utilize the text data and rating scores for each row.
| Column Name | Data Type | Description |
|---|---|---|
| review_id | STRING | Identification tag for each review |
| customer_id | STRING | Identification tag for each customer |
| business_id | STRING | Identification tag for each business |
| stars | INT | Numerical score of review from user |
| date | STRING | Date when review was made by user |
| text | STRING | Full content of user / customer review for the business page |
| useful | INT | Number of users in platform that regard the review as useful |
| cool | INT | Number of users in platform that regard the review as cool |
| funny | INT | Number of users in platform that regard the review as funny |
The dataset contains 5,996,996 reviews. It would be operationally expensive and time-consuming to exhaust all reviews for this study. Thus, design limitations were set in place.
Extreme Ratings
The scope was limited to only positive and negative reviews defined by star ratings of 4 and 5, and 1 and 2, respectively. It is believed that neutral 3-star reviews will not be valuable in the long run in terms of business value. It would be better to improve on customer pain points in negative reviews and retain behaviors found in positive reviews. Removing three star ratings readily removes more than 10% of the dataset.
Latest Reviews
The scope also limits the reviews to 2017 data. This is the latest complete year as the 2018 data is incomplete. This ensures biases towards seasonality can be ignored in the dataset such as more reviews during peak seasons in certain industries. In terms of business value, it is also best to use the latest reviews as these would be most relevant to customers visiting Yelp. 2003 reviews may not be as relevant as 2017 reviews say in 2021.
Setting limitations A and B reduces the data for positive reviews to 809,314 entries and for negative reviews at 278,366, making it more manageable.
Latent Factors
Only 100 latent factors were set in determining topics. Although it explains less variance, this significantly reduces runtime for singular value decomposition and clustering. Too many topics will also limit their interpretation. For Latent Semantic Analysis, this value is actually recommended in sci-kit learn documentation[6].
Investigating the topics between positive and negative Yelp reviews entails the following steps: 1) Extracting a subset of the data, 2) Preprocess the corpus, 3) Create a TF-IDF matrix, or document term matrix, 4) Break down the TF-IDF matrix into topics using latent semantic analysis, 5) Extract the topic-encoded data, 6) Cluster the data using the extracted latent factors, and 7) Evaluate the topics and clusters.
util.Image('img/Diagram_Methodology_Lab4_v2.png')
We extracted our data from the following path:
/mnt/data/public/yelp/challenge12/yelp_dataset/yelp_academic_dataset_review.json
This a subset of the entire Yelp reviews dataset and we focused on the year 2017 for our analysis. We excluded any 3-Star reviews and centered our attention to opposite ends of the spectrum – positive (more than 3 Stars) and negative (less than 3 Stars).
From here, we separate the dataset into positive and negative review subsets. We expect the underlying topics and, subsequently, clusters for each subset to be unique and should explain a large proportion of their respective variances.
# # preprocess and saves to pkl files
# docs_pos, docs_neg = util.preprocess_docs(df_pos, df_neg)
# # vectorize docs and save to pkl files
# util.tfidf()
display(util.pd.read_csv('positive_2017_reviews.csv',usecols=['text']).tail())
display(util.pd.read_csv('negative_2017_reviews.csv',usecols=['text']).tail())
Before creating the document term matrix, we first pre-processed the data with the following steps:
This was done to each review in the positive and negative dataset to build the corpus passed on to the vectorizer. The corpus is then saved after.
with open('corpus_pos.pkl', 'rb') as fp:
display(util.pickle.load(fp)[:3])
with open('corpus_neg.pkl', 'rb') as fp:
display(util.pickle.load(fp)[:3])
Once preprocessing is done, we then tokenize each row of text to create the Term Frequency-Inverse Document Frequency (TF-IDF) matrix, or the document term matrix. This will allow us to produce the best representation of reviews for clustering as similarity between documents can be easily computed and the vectorizer gives less weight for words appearing in more documents. This is done for both positive and negative reviews. The sparse matrix and vocabulary are saved for further use.
Singular Value Decomposition (SVD) decomposes a given matrix into three representative matrices. If SVD is used on vectorized text data, it is known as Latent Semantic Analysis (LSA). We employed LSA to break down the TF-IDF matrix into 100 topics vis-a-vis the individual tokens and rows of text. This is the best way to decompose the TF-IDF matrix as SVD readily handles sparse matrices and topics will be orthogonal to each other. These will be the basis of our clustering rather than directly clustering based on the sparse matrix.
After decomposing the TF-IDF matrix, we first determine which type of clustering is best. kMeans is chosen as the primary method compared to other types of clustering as the least computationally expensive given the size of the dataset. It also has a good range of internal validation procedures used to evaluate the proper number of clusters. With a large number of topics to consider, these checks added more validation in determining the clusters.
Once the clustering algorithm was chosen, we then chose the number of appropriate clusters to best represent the data using the internal validation criteria for positive and negative reviews. After determining the number of clusters, we moved on to evaluate each cluster.
Lastly, we then interpreted each cluster by creating word clouds of reviews of each cluster. Word clouds show sizes of words based on its count in the corpus. This is a good method for quick visual interpretation in the corpus and passes the need to evaluate individual topics.
# util.plot_year_dist(data)
util.Image('img/Yearly_Rating_Distribution.png')
The total dataset provided ran from Yelp’s inception up to 2018. This number is not cumulative. There is an increasing trend on the amount of reviews per year. To provide more relevant reviews to customers, it would be best to use the most recent data gathered.
Since the 2018 data is not completed up to a year, the next best to use is the 2017 data. It would be best to use up to n years of data such as that within the decade, but due to the large volume of the dataset and the computational cost of running vectorizers, it would suffice to use 2017 data.
# util.plot_ratings_dist(data)
util.Image('img/Ratings_Distribution.png')
The distribution of customer reviews is skewed towards positive with around 66.3% of reviews having 4 or 5 Stars. Due to the dataset being imbalanced, performing Latent Semantic Analysis would be best done by separating negative and positive reviews. In this way, the TF-IDF metric will not be biased towards words in negative reviews given that the Inverse Document Frequency is measured by the scarcity in documents.
In this study, 4 and 5 star ratings are grouped to form positive reviews, and 1 and 2 star ratings are grouped to form negative reviews. 3 star ratings are removed as they are deemed to be neutral and may not provide as much information as the extremes. Dropping 3 star ratings also reduces our data and saves on computational capacity.
util.Image('Yelp_Ratings.JPG')
Based on the publicly available image on Yelp’s statistics, the distribution of the dataset roughly follows this trend. It can be said that this would produce an accurate representation of the dataset. Further testing of goodness-of-fit could be done to see if the sample dataset distribution follows the distribution released by Yelp.
Upon performing Latent Semantic Analysis for each group, we get the cumulative variance explained by the singular vectors. For the positive reviews, 458 components are needed to explain at least 90% of the data. For the negative reviews, 609 components are needed to explain at least 90% of the data.
# svd_pos, lsa_pos = util.lsa_100_pos()
# util.plot_var_explained(svd_pos)
util.Image('img/Var_Exp_pos.jpg')
# svd_neg, lsa_neg = util.lsa_100_neg()
# util.plot_var_explained(svd_neg)
util.Image('img/Var_Exp_neg.png')
We see that there is less variance in negative reviews thus needing more components. The difference in variations can be attributed to the size difference of the positive and negative groups. With everything else held constant, a smaller dataset has less variation compared to a large dataset.
It is also possible that the lower variance can be attributed to the removal of stop words and contractions as we expect negative reviews to have more negative adverbs.
The scatter plot below shows the customer reviews plotted against the first and second singular vectors. LSA decomposition shows that the blurb roughly forms triangular-shaped clusters where the origin starts at the axis of the second singular vector. The behavior of these plots are shared by both positive and negative customer reviews, but this is more distinct with the negative reviews. However, we may see a more distinct behavior for both subsets if additional data points are included.
# util.plot_lsa(lsa_pos)
util.Image('img/reviews_pos.png')
# util.plot_lsa(lsa_neg)
util.Image('img/reviews_neg.png')
Although the rule of thumb in selecting the number of clusters is identifying the elbow in the SSE distance plot, this was not apparent in either negative or positive reviews. The CH index peaked at k=2 and further declined as we increased k. As such, we had to rely on the remaining two measures to find the optimal k.
The Silhoutte coefficient and gap statistic plots had more promising behaviors as we increased k. Although still not as evident as we would like, both measures spiked at k=5 which suggested the optimal number of clusters. In focusing on these measures, we hoped we could find clusters that were 1) highly defined by their latent factors, and 2) clearly delineated by their separation.
# res_pos = util.cluster_range(lsa_pos, KMeans(random_state=1337), 2, 11)
# util.plot_clusters(lsa_pos, res_pos)
util.Image('img/k_clusters_pos.png')
# util.plot_internal(res_pos['inertias'], res_pos['chs'], res_pos['scs'],
# res_pos['gss'], res_pos['gssds']);
util.Image('img/internal_validation_pos.png')
# res_neg = util.cluster_range(lsa_pos, KMeans(random_state=1337), 2, 11)
# util.plot_clusters(lsa_neg, res_neg)
util.Image('img/k_clusters_neg.png')
# util.plot_internal(res_pos['inertias'], res_pos['chs'], res_pos['scs'],
# res_pos['gss'], res_pos['gssds']);
util.Image('img/internal_validation_neg.png')
Plotting the positive reviews dataset again after applying five clusters we see the image below.
# util.plot_clustered(lsa_pos, kmeans_pos)
util.Image('img/clustered_pos.png')
Next we interpret using the word clouds for each of the clusters in the positive reviews set.
# util.word_cloud(5, lsa_pos, 'corpus_pos.pkl')
util.Image('img/cluster_pos_1.png')
The first cluster is evidently about pizza restaurants. Words such as “great”, “food”, “great”, and “place” describe a pizzeria that provides good-quality food and service. The physical restaurant is itself appealing. This also means that there are a lot of pizzerias registered on the website and tend to generally be a place that generates good reviews, given that this is a prominent cluster. Most of the words also refer to positive adjectives. It also includes “family” and “experience” which means pizzerias tend to be a place of good experiences specifically for families.
util.Image('img/cluster_pos_2.png')
The second cluster is about restaurants in general since there is no major type of food. “Spicy” is also a popular flavor to be mentioned in good reviews. “Always” seems to be unique to this cluster which may suggest these reviews are from customers that regularly come back or service has been consistently good. This means that positive reviews can be a form of pledge to loyalty for the customers due to their positive experience.
util.Image('img/cluster_pos_3.png')
The third cluster pertains to general service-based businesses due to the lack of flavor, food and restaurant. We see several words that may be objects of praise, such as “service”, “time”, “place”, and “staff”. We see “thank” as a unique word for this cluster which may verify that the positive review is to a representative or to staff. This can be a great cluster to analyze. From the distribution of businesses on the Yelp website, restaurants take only about 50% of the number of businesses on the site [1].
util.Image('img/cluster_pos_4.png')
The fourth cluster focuses on reviews specifically for nail salons which is also verified by the words “nail”, “salon”, and “pedicure”. In this cluster, the service is greatly praised with several positive adjectives (i.e. “friendly”, “great”, “amazing”, “love”). This also determines that nail salons are also a major business on the platform that generates positive reviews such as that of the pizzeria. This is a very service-oriented business and is also probably how they advertise and form connections with customers.
util.Image('img/cluster_pos_5.png')
The last cluster revolves around recommendations but are focused on serviced-based businesses. Words such as “professional”, “staff”, “work”, “friendly”, and “experience” describe the quality of the service rendered by the business to the customer. We can also see that “highly recommended” is a phrase that is commonly used and these words commonly go together, given their similar size. This cluster represents reviews referring to a call to action for other customers.
Plotting the negative reviews dataset again after applying five clusters we see the image below. Below it are the word clouds for each of the clusters in the negative reviews set and their interpretations.
# util.plot_clustered(lsa_neg, kmeans_neg)
util.Image('img/clustered_neg.png')
# util.word_cloud(5, lsa_neg, 'corpus_neg.pkl')
util.Image('img/cluster_neg_1.png')
For the first cluster, the reviews revolve around restaurant businesses. Words such as “taste”, “menu”, and “price” are prominent. In these reviews, there could be complaints on the quality of the food or the price of food at the restaurant.
util.Image('img/cluster_neg_3.png')
The second cluster looks into food service specifically. We see words such as “minute”, “server”, “service”, “time” and “waitress”. These relate specifically to the service of the food business, possibly the wait times or how the staff were acting in the restaurant.
util.Image('img/cluster_neg_4.png')
The third cluster has two dominant words, “one” and “time” which may suggest complaints on general customer service. This cluster does not revolve around a single particular business. These may suggest service commitments that were not met by the business. This is a possible pain point and will be touched on further.
util.Image('img/cluster_neg_2.png')
In this cluster, we see “car” as the dominant word in this cluster. We believe this cluster focuses on car repair and maintenance. Based on the corpus, it can be inferred that a service representative may have told the customer that they can get their vehicle back in a certain time, which are also seen in the cloud as “back” and “time”. However, because this is a negative review, the obligation to finish repairs and return the vehicle may have been delayed. This is a possible customer pain point and will be further discussed.
util.Image('img/cluster_neg_5.png')
The last cluster highlights complaints in the nail and hair salon industry inferred by words such as “nail”, “hair”, and “salon”. As a service-oriented industry, they rely heavily on good reviews so this can be a possible point of improvement.
Between the positive and negative reviews datasets, we actually see similarities between the clusters formed. Customer service, food quality and place of business are similar clusters and should always be a priority for the businesses especially if they rely on reviews for their business.
We can also see that increasing the number of clusters actually results in clustering reviews based on categories of businesses on Yelp such as food, cars and salons. This can be attributed to similar businesses being critiqued on the same features.
Based on clustering of positive reviews dataset, we can say that for customers to give good reviews to businesses, they should focus on customer service, food quality and even ambience. A great experience encourages positive reviews and even provides a call to action for other customers. These customer touch points should be taken care of to maximize its value in retaining old customers and creating new ones.
Pizza is also a popular place to go to for a good experience so general recommendations on the app should skew towards pizzerias for a good time. This can be due to the fact that Yelp reviews are for businesses in the US and this is a popular food for Americans.
Based on the clustering of negative reviews dataset, it should be noted as well that customer service is the main customer pain point that drives them to leave a negative review. To avoid this, businesses should invest in better customer service such as training or even automated solutions for easier access.
It is also important to note that the worst perpetrator for bad reviews is failure in commitment to service. If a business promises they will complete a service at a certain time, they should commit to it. Otherwise, they are more likely to post a negative review. This includes food service such as wait time to order and car repair where a specific time frame is expected. Again, since Yelp is for US businesses, pain points regarding cars would be numerous as this is their main mode of transport.
The study may be used as a tool to address critical issues in their operations that may not be apparent in their typical feedback channels (e.g. monthly sales figures, quarterly management KPI reviews, etc.). As such, further applications of the study could provide more reliable feedback to businesses.
It would be possible as well for Yelp to give this information to its businesses which would help them in customer touch points, thus boosting website traffic for Yelp.
In addition, understanding the underlying components of meaningful customer reviews should provide a more transparent platform for site visitors or app users. We expect results of the study could improve user experience. This should in turn add incentive for customers to post their own reviews – either to support or counter the current store of reviews.
A key weakness of our study has been to cast a very broad net on the data subset. Although we did limit our analysis to a single year and bisect it to positive and negative reviews, we did not further focus on a single category. As Yelp caters to a wide variety of businesses, the characteristics of customer reviews for restaurants may be totally different from a house contractor or nail salon. As such, this may explain the hundreds of topics to explain 90% of the variance explained for positive reviews.
Ideally, we use all factors to reach at least 90% of the variance explained from LSA to cluster the reviews. Further improvement could be the use of all positive and negative reviews. From there even more exploratory data analysis such as statistical testing of clusters and determining the top words of the most variance explained topics in each cluster could be done. This gives more interpretation to the topics from LSA. Going into specific topics would yield more insights to customer reviews for improvement.
For future studies, it is recommended to check also for trends of topics over time. It is also recommended to perform clustering on the vectorized corpus with order in mind with the use of cosine similarity as the basis of clustering can also be done since this measure works best for text data. It would also be recommended to explore grouping businesses in Yelp by the types of reviews such as those found when increasing the number of clusters rather than a set list.
Yelp, Inc. (2021, June 30). An Introduction to Yelp Metrics as of June 30, 2021. www.yelp-press.com. https://www.yelp-press.com/company/fast-facts/default.aspx
Vendasta. (n.d.). Two big reasons why reviews on review sites are kind of a big deal. www.vendasta.com. https://www.vendasta.com/blog/top-10-customer-review-websites/
Wright, D. (2018, November 1). Managing Impromptu Diners Just Got Easier with Two New Yelp Waitlist Features. www.yelp.com. https://blog.yelp.com/2018/11/managing-impromptu-diners-just-got-easier-with-two-new-yelp-nowait-features
Yelp, Inc. (2014, May 31). Yelp Reservations: New Free Tool for Businesses Gives Diners Even More Options When Booking Online. www.yelp.com. https://blog.yelp.com/2014/05/yelp-reservations-new-free-tool-for-businesses-gives-diners-even-more-options-when-booking-online
Shiran, A. (2020, May 4). Yelp Launches an All New Modernized and Enhanced Experience for Business Owners. www.yelp.com. https://blog.yelp.com/2020/05/yelp-launches-an-all-new-modernized-and-enhanced-experience-for-business-owners
Scikit-learn. (n.d). sklearn.decomposition.TruncatedSVD. www.scikit-learn.org. https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html